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Abstract 

Training a Support Vector Machine (SVM) requires the solution of a 
quadratic programming problem (QP) whose computational complexity 
becomes prohibitively expensive for large scale datasets. Traditional op- 
timization methods cannot be directly applied in these cases, mainly due 
to memory restrictions. 

By adopting a slightly different objective function and under mild 
conditions on the kernel used within the model, efficient algorithms to 
train SVMs have been devised under the name of Core Vector Machines 
(CVMs). This framework exploits the equivalence of the resulting learn- 
ing problem with the task of building a Minimal Enclosing Ball (MEB) 
problem in a feature space, where data is implicitly embedded by a kernel 
function. 

In this paper, we improve on the CVM approach by proposing two 
novel methods to build SVMs based on the Prank- Wolfe algorithm, re- 
cently revisited as a fast method to approximate the solution of a MEB 
problem. In contrast to CVMs, our algorithms do not require to compute 
the solutions of a sequence of increasingly complex QPs and are defined 
by using only analytic optimization steps. Experiments on a large collec- 
tion of datasets show that our methods scale better than CVMs in most 
cases, sometimes at the price of a slightly lower accuracy. As CVMs, the 
proposed methods can be easily extended to machine learning problems 
other than binary classification. However, effective classifiers are also ob- 
tained using kernels which do not satisfy the condition required by CVMs 
and can thus be used for a wider set of problems. 



1 Introduction 



Support Vector Machines (SVMs) are currently one of the most effective meth- 
ods to approach classification and other machine learning problems, improving 
on more traditional techniques like decision trees and neural networks in a num- 
ber of applications [TSl [33] ■ SVMs are defined by optimizing a regularized risk 
functional on the training data, which in most cases leads to classifiers with 
an outstanding generalization performance |39[ I33j . This optimization problem 
is usually formulated as a large convex quadratic programming problem (QP), 
for which a naive implementation requires O(to^) space and 0{m^) time in the 
number of examples m, complexities that are prohibitively expensive for large 
scale problems [331 [37]. Major research efforts have been hence directed towards 
scaling up SVM algorithms to large datasets. 

Due to the typically dense structure of the hessian matrices involved in the 
QP, traditional optimization methods cannot be directly applied to train an 
SVM on large datasets. The problem is usually addressed using an active set 
method where at each iteration only a small number of variables are allowed to 
change [35] [TSl [3U] . In non-linear SVMs problems, this is essentially equivalent 
to selecting a subset of training examples called a working set '39] . The most 
prominent example in this category of methods is Sequential Minimal Optimiza- 
tion (SMO), where only two variables are selected for optimization each time 
[SI |3D] . The main disadvantage of these methods is that they generally exhibit 
a slow local rate of convergence, that is, the closer one gets to a solution, the 
more slowly one approaches that solution. Moreover, performance results are 
in practice very sensitive to the size of the active set, the way to select the 
active variables and other implementation details like the caching strategy used 
to avoid repetitive computations of the kernel function on which the model is 
based (35] Other attempts to scale up SVM methods consist in adapting interior 
point methods to some classes of the SVM QP. ^ . For large-scale problems how- 
ever the resulting rank of the kernel matrix can still be too high to be handled 
efficiently |37]. The reformulation of the SVM objective function as in 5T2j, the 
use of sampling methods to reduce the number of variables in the problem as in 
and [5n], and the combination of small SVMs using ensemble methods as 
in [5P have also been explored. 

Looking for more efficient methods, in ^3T| a new approach was proposed: 
the task of learning the classifier from data can be transformed to the problem 
of computing a minimal enclosing ball (MEB), that is, the ball of smallest 
radius containing a set of points. This equivalence is obtained by adopting a 
slightly different penalty term in the objective function and imposing some mild 
conditions on the kernel used by the SVM. Recent advances in computational 
geometry have demonstrated that there are algorithms capable of approximating 
a MEB with any degree of accuracy e in 0{l/e) iterations independently of the 
number of points and the dimensionality of the space in which the ball is built 
[37] . Adopting one of these algorithms, Tsang and colleagues devised in [37 the 
Core Vector Machine (CVM), demonstrating that the new method compares 
favorably with most traditional SVM software, including for example software 
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based on SMO [gHSU]. 

CVMs start by solving the optimization problem on a small subset of data 
and then proceed iteratively. At each iteration the algorithm looks for a point 
outside the approximation of the MEB obtained so far. If this point exists, it 
is added to the previous subset of data to define a larger optimization problem, 
which is solved to obtain a new approximation to the MEB. The process is re- 
peated until no points outside the current approximating ball are found within 
a prescribed tolerance. CVMs hence need the resolution of a sequence of opti- 
mization problems of increasing complexity using an external numerical solver. 
In order to be efficient, the solver should be able to solve each problem from 
a warm-start and to avoid the full storage of the corresponding Gram matrix. 
Experiments in Ref. 37 employ to this end a variant of the second-order SMO 
proposed in [8]. 

In this paper, we study two novel algorithms that exploit the formalism of 
CVMs but do not need the resolution of a sequence of QPs. These algorithms 
are based on the Frank- Wolfe (FW) optimization framework, introduced in [TT] 
and recently studied in jH] and U as a method to approximate the solution 
of the MEB problem and other convex optimization problems defined on the 
unit simplex. Both algorithms can be used to obtain a solution arbitrarily close 
to the optimum, but at the same time are considerably simpler than CVMs. 
The key idea is to replace the nested optimization problem to be solved at each 
iteration of the CVM approach by a linearization of the objective function at the 
current feasible solution and an exact line search in the direction obtained from 
the linearization. Consequently, each iteration becomes fairly cheaper than a 
CVM iteration and does not require any external numerical solver. 

Similar to CVMs, both algorithms incrementally discover the examples which 
become support vectors in the SVM model, looking for the optimal set of weights 
in the process. However, the second of the proposed algorithms is also endowed 
with the ability to explicitly remove examples from the working set used at each 
iteration of the procedure and has thus the potential to compute smaller models. 
On the theoretical side, both algorithms are guaranteed to succeed in 0(l/e) 
iterations for an arbitrary e > 0. In addition, the second algorithm exhibits an 
asymptotically linear rate of convergence [H] . 

This research was originally motivated by the use of the MEB framework 
and computational geometry optimization for the problem of training an SVM. 
However, a major advantage of the proposed methods over the CVM approach 
is the possibility to employ kernels which do not satisfy the conditions required 
to obtain the equivalence between the SVM and MEB optimization problems. 
For example, the popular polynomial kernel does not allow the use of CVMs as 
a training method. Since the optimal kernel for a given application cannot be 
specified a priori^ the capability of a training method to work with any valid 
kernel function is an important feature. Adaptations of the CVM to handle 
more general kernels have been recently proposed in |38] but, in contrast, our 
algorithms can be used with any Mercer kernel without changes to the theory 
or the implementation. 

The effectiveness of the proposed methods is evaluated on several data clas- 
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sification sets, most of them already used to show the improvements of CVMs 
over second-order SMO [37]. Our experimental results suggest that, as long as 
a minor loss in accuracy is acceptable, our algorithms significantly improve the 
actual running times of this algorithm. Statistical tests are conducted to as- 
sess the significance of these conclusions. In addition, our experiments confirm 
that effective classifiers are also obtained with kernels that do not fulfill the 
conditions required by CVMs. 

The article is organized as follows. Section 2 presents a brief overview on 
SVMs and the way by which the problem of computing an SVM can be treated 
as a MEB problem. Section 3 describes the CVM approach. In Section 4 we 
introduce the proposed methods. Section 5 presents the experimental setting 
and our numerical results. Section 6 closes the article with a discussion of the 
main conclusions of this research. 

2 Support Vector Machines and the MEB Equiv- 
alence 

In this section we present an overview on Support Vector Machines (SVMs), 
and discuss the conditions under which the problem of building these models 
can be treated as a Minimal Enclosing Ball (MEB) problem in a feature space. 

2.1 The Pattern Classification Problem 

Consider a set of training data S — {x^} with x^ e A", « G I = {1, . . . , m}. The 
set X, often coinciding with M", is called the input space, and each instance 
is associated with a given category in the set C = {ci, C2, . . . , c^}. A pattern 
classification problem consists of inferring from S a prediction mechanism / : 
A" — >■ C G J^, termed hypothesis, to associate new instances x G A" with the 
correct category. When K = 2 the problem described above is called binary 
classification. This problem can be addressed by defining a set of candidate 
models J^, a risk functional Ri {S, /) assessing the ability of / to correctly predict 
the category of the instances in X, and a procedure L by which a dataset S is 
mapped to a given hypothesis / = L{S) G J- achieving a low risk. In the context 
of machine learning, L is called the learning algorithm, T the hypothesis space 
and Ri{S,f) the induction principle [55] . 

In the rest of this paper we focus on the problem of computing a model 
designed for binary classification problems. The extension of these models to 
handle multiple categories can be accomplished in several ways. A possible 
approach corresponds to use several binary classifiers, separately trained and 
joined into a multi-category decision function. Well known approaches of this 
type are one-versus-the-rest (OVR, see [31]), where one classifier is trained to 
separate each class from the rest; one-versus-one (OVO, see [19]), where different 
binary SVMs are used to separate each possible pair of classes; and DDAG, 
where one-versus-one classifiers are organized in a directed acyclic graph decision 
structure [33] • Previous experiments with SVMs show that OVO frequently 
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obtains a better performance both in terms of accuracy and training time [T7] . 
Another type of extension consists in reformulating the optimization problem 
underlying the method to directly address multiple categories. See [6], [23], [28] 
and [T] for details about these methods. 

2.2 Linear Classifiers and Kernels 

Support Vector Machines implement the decision mechanism by using simple 
linear functions. Since in realistic problems the configuration of the data can be 
highly non-linear, SVMs build a linear model not in the original space X, but in 
a high-dimensional dot product feature space V — lin {(j){X)) where the original 
data is embedded through the mapping p — (j){x) for each x e A". In this space, 
it is expected that an accurate decision function can be linearly represented. 
The feature space is related with X by means of a so called kernel function 
k : X X X ^ R, which allows to compute dot products in V directly from the 
input space. More precisely, for each x^, Xj g X, we have pfpj = k{xi,Xj). 
The explicit computation of the mapping 0, which would be computationally 
infeasible, is thus avoided [33]. For binary classification problems, the most 
common approach is to associate a positive label yi = +1 to the examples of 
the first class, and a negative label yi = —1 to the examples belonging to the 
other class. This approach allows the use of real valued hypotheses h : V ^ R, 
whose output is passed through a sign threshold to yield the classification label 
/(x) = sgn(/i(p)) = sgn{h{(f){x.))). Since h{p) is a linear function in V, the final 
prediction mechanism takes the form 

/(x) = sgn (/i(0(x))) = sgn (w^0(x) + b) , (1) 

with w E V and & G M. This gives a classification rule whose decision boundary 
H = {p : w^p + 6 = 0} is a hyperplane with normal vector w and position 
parameter b. 

2.3 Large Margin Classifiers 

It should be noted that a decision function well predicting the training data does 
not necessarily classify well unseen examples. Hence, minimizing the training 
error (or empirical risk) 

^^|l-yj(x,)| , (2) 

does not necessarily imply a small test error. The implementation of an in- 
duction principle Ri{S,f) guaranteeing a good classification performance on 
new instances of the problem is addressed in SVMs by building on the con- 
cept of margin p. For a given training pair JCi,yi^ the margin is defined as 
Pi = pf{xi,yi) — yih{xi) = yi (w'^0(x) -|- b) and is expected to estimate how 
reliable the prediction of the model on this pattern is. Note that the example 
Xj is misclassificd if and only if pi < 0. Note also that a large margin of the 
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pattern suggests a more robust decision with respect to changes in the param- 
eters of the decision function /(x), which are to be estimated from the training 
sample |33j . The margin attained by a given prediction mechanism on the full 
training set S is defined as the minimum margin over the whole sample, that 
is p = min^gx P/(xi, This implements a measure of the worst classification 
performance on the training set, since Pi > p Vi |35| . Under some regularity 
conditions, a large margin leads to theoretical guarantees of good performance 
on new decision instances |39j . The decision function maximizing the margin 
on the training data is thus obtained by solving 

maximize p — maximize min pf(j>Ci,yi). (3) 

or, equivalently, 

maximize p 

subject to Pi > p, i e I . 

However, without some constraint on the size of w, the solution to this maximin 
problem does not exist |35[ I14| . On the other hand, even if we fix the norm of 
w, a separating hyperplane guaranteeing a positive margin p/(xi,j/i) on each 
training pattern need not exist. This is the case, for example, if a high noise level 
causes a large overlap of the classes. In this case, the hyperplane maximizing 
([3| performs poorly because the prediction mechanism is determined entirely 
by misclassified examples and the theoretical results guaranteeing a good classi- 
fication accuracy on unseen patterns no longer hold 35 . A standard approach 
to deal with noisy training patterns is to allow for the possibility of examples 
violating the constraint pi > p V i and by computing the margin on a subset of 
training examples. The exact way by which SVMs address these problems gives 
rise to specific formulations, called soft-margin SVMs. 



2.4 Soft-Margin SVM Formulations 

In Li-SVMs (see e.g. [51 |33J [H]), degeneracy of problem ^ is addressed by 
scaling the constraints Pi > p as > p and by adding the constraint ||w|| — 
so that the problem now takes the form of the quadratic programming problem 

minimize i^Hwlp 

subject to p/(xi,?/i) > 1, ieT. 

Noisy training examples are handled by incorporating slack variables > to 
the constraints in ^ and by penalizing them in the objective function: 

minimize ^ 1 1 w 1 1 + C \ 

iei (6) 
subject to p/(xi, j/i) > 1 - ^j, iel. 
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This leads to the so called soft-margin Li-SVM. In this formulation, the pa- 
rameter C controls the trade-ofF between margin maximization and margin con- 
straints violation. 

Several other reformulations of problem ([S]) can be found in literature. In 
particular, in some formulations the two-norm of ^ is penalized instead of the 
one-norm. In this article, we are particularly interested in the soft margin 
L2-SVM proposed by Lee and Mangasarian in In this formulation, the 

margin constraints Pi > p in ([s]) are preserved, the margin variable p is explicitly 
incorporated in the objective function and degeneracy is addressed by penalizing 
the squared norms of both w and b, 

minimize W ||w|p + 6^ + C ' 



subject to pf{xi,yi) > p - ^i, iel. 

In practice, L2-SVMS and Li-SVMs usually obtain a similar classification 
accuracy in predicting unseen patterns |241 137j . 

2.5 The Target QP 

In this paper we focus on the i2-SVM model as described above. The use of this 
formulation is mainly motivated by efficiency: by adopting the slightly modified 
functional of Eqn. [7] we can exploit the framework introduced in ^37j and solve 
the learning problem more easily, as we will explain in the next Subsection. 
As a drawback, the constraints of problem ([t]) explicitly depend on the images 
Pi = <^(xi) of the training examples under the mapping (f). In practice, to avoid 
the explicit computation of the mapping, it is convenient to derive the Wolfe 
dual of the problem by incorporating multipliers a; > 0, i € I and considering 
its Lagrangian 

L{^V,b,ta)^^\M■'+b' + CY,^A-p-Y,C^^{p^i^^,y^)~P + ^^) ■ (8) 

From the Karush-Kuhn- Tucker conditions for the optimality of ([T]) with respect 
to the primal variables we have (see [51 [551 H^ ): 



dL 
dw 


= ^ w = 


= E 

iei 


dL 
'db 


= 0^b = 


iei 


dL 


- ^ = 


ai 
' C' 


dL 
dp 







(9) 

i e z 
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Plugging into the Lagrangian, we have 

1 1 



By definition of Wolfe dual (see [33]), it immediately follows that ([t]) is equivalent 
to the following QP: 



maximize - ^ aiUj i yiVjpfpj + ViVj ' 



C 



subject to X^ ai ~ 1, > 0, i € I, 
iei 



(11) 



where Sij is equal to 1 if j = j, and otherwise. In contrast to Q, the problem 
above depends on the training examples images Pi = (f>{xi) only through the 
dot products pfpj- By using the kernel function we can hence obtain a problem 
defined entirely on the original data 



ijex ^ ^ 
subject to X^ ai = 1, > 0, iei. 



(12) 



iex 



From equations we can also write the decision function ([!]) in terms of the 
original training examples as /(x) = sgn (/i(x)), where 

/i(x) = w^0(x) + = XI ^^y^ (^(^»' + 1) ■ (13) 

Note that the decision function above depends only on the subset of training 
examples for which a.^ ^ 0. These examples are usually called the support vectors 
of the model [33' . The set of support vectors is often considerably smaller than 
the original training set. 



2.6 Computing SVMs as Minimal Enclosing Balls (MEBs) 

Now we explain why the L2-SVM formulation introduced in the previous para- 
graphs can lead to efficient algorithms to extract SVM classifiers from data. 
As pointed out first in [37] and then generalized in [3H], the L2-SVM can be 
equivalently formulated as a MEB problem in a certain feature space, that is, 
as the computation of the ball of smallest radius containing the image of the 
dataset under a mapping into a dot product space Z. 

Consider the image of the training set S under a mapping (p, that is, (p{S) = 
{zj = <^(xi) : i S I}. Suppose now that there exists a kernel function k such 
that fc(xi,Xj) = (p{xi)'^ ip{xj) € X. Denote the closed ball of center c £ Z 
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and radius r G K+ as i3(c, r). The MEB B{c*,r*) of (p{S) can be defined as the 
solution of the following optimization problem 



minimize r 
subject to \\z,i 



-c 



el. 



(14) 



By using the kernel function k to implement dot products in Z, the following 
Wolfe dual of the MEB problem is obtained (see [4T]): 



maximize ^{ot) := ajfc(xj, x^) — Q!jajfc(xi, Xj) 

subject to ai = 1, Q!^ > 0, i E T . 

iex 



(15) 



If we denote by a* the solution of (151, formulas for the center c* and the 
squared radius r*^ of MEB{(p{S)) follow from strong duality: 



c = 



i£X 



(16) 



Note that the MEB depends only on the subset of points C for which a* ^ 
0. It can be shown that computing the MEB on C C ^(S) is equivalent to 
computing the MEB on the entire dataset '^{S). This set is frequently called a 
coreset of f{S), a concept we are going to explore further in the next sections. 

We immediately notice a deep similarity between problems (12) and (15), 



the only difference being the presence of a linear term in the objective function 
of the latter. This linear term can be neglected under mild conditions on the 
kernel function k. Suppose k fulfills the following normalization condition: 



fc(Xi 



constant . 



(17) 



Since X^igi o;* ~ 1; ^^e linear term ^^^j Q;ifc(x.i, x^) in (15) becomes a con 



stant and can be ign ored whe n o ptimizing; for a. Equivalence between the 
solutions of problems ( 12 ) and ( 15 ) follows if we set k to 



fc(xi,Xj) = yiyj (fc(xi,Xj) + 1) + 



C 



(18) 



where k is the kernel function used within the SVM classifier. Therefore, com- 
puting an SVM for a set of labelled data S = {x^ : z € 1} is equivalent to 
computing the MEB of the set of feature points (p{S) = {z^ — (p{xi) : i £ I}, 
where the mapping (p satisfies the condition k{xi,Xj) = Lp{:x.i)'^ (p{x.j). A pos- 
sible implementation of such a mapping is (p(xi) — (^yi(j){ii.i),yi, ^e^^, where 

4>(yii) is in turn the mapping associated with the original Mercer kernel k used 
by the SVM. 

Note that the previous equivalence between the MEB and the SVM problems 
holds if and only if the kernel k fulfills assumption ( 17 ) . If, for example, the SVM 
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classifier implements the well-known d-th order polynomial kernel fc(xi,Xj) = 

(xfxj + l)'^, we have that fc(xi,Xi) is no longer a constant, and thus the MEB 
equivalence no longer holds. Complex constructions are required to extend the 
MEB optimization framework to SVMs using different kernel functions [55] . 



Badoiu-Clarkson Algorithm and Core Vector 
Machines 



Problem (151 is in general a large and dense QP. Obtaining a numerical solution 
when m is large is very expensive, no matter which kind of numerical method one 
decides to employ. Taking into account that in practice we can only approximate 
the solution within a given tolerance, it is convenient to modify a priori our 
objective: instead of MEB((^(S')), we can try to compute an approximate MEB 
in the sense specified by the following definition. 

Definition 1. Let MEB{(p{S))—B{c* ,r*) and e > be a given tolerance. Then, 
a (1 + e)~MEB of ip{S) is a ball B{c,r) such that 

r <r* and ip{S) C B{c, (1 + e)r) . (19) 

A setCs C ip{S) is an e-coreset of ip{S) if MEB{Cs) is a {l + e)-MEB ofip{S). 

In P] and [H], algorithms to compute (H-e)-MEBs that scale independently 
of the dimension of Z and the cardinality of S have been provided. In particular, 
the Badoiu-Clarkson (BC) algorithm described in [5] is able to provide an e- 
coreset Cs of (p{S) in no more than 0(l/e) iterations. We denote with Ck 
the coreset approximation obtained at the fc-th iteration and its MEB as Bk = 
yB(cfc, Tfe). Starting from a given Cq, at each iteration C^+i is defined as the union 
of Ck and the point of (p{S) furthest from c^. The algorithm then computes Bk+i 
and stops if B{ck+i, (1 + e)rk+i) contains ^fiiS). 

Exploiting these ideas, Tsang and colleagues introduced in [37] the CVM 
(Core Vector Machine) for training SVMs supporting a reduction to a MEB 
problem. The CVM is described in Algorithm [T] where each Ck is identified by 
the index set Ij. C I. The elements included in Ck are called the core vectors. 
Their role is exactly analogous to that of support vectors in a classical SVM 
model. 



The expression for the radius follows easily from ( 16 ). Moreover, it is easy 



to show (see [37]) that step 13 exactly looks for the point x^. whose image (p{Ki* ) 
is the furthest from Cj,. In fact, by using the expressions Ck — X^jei^ ^k.j'^j and 
A;(xj,Xi) — Vi el, we obtain: 

||zt - Cfclp = + ^ Q;fcjafcjfc(xj,Xi) - 2 ^ afcjfc(xj,Xi) 

= A^ + - 2 ^ akjk{xj,x,) . 
J elk 
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Note how this computation can be performed, by means of kernel evalu- 
ations, in spite of the lack of an explicit representation of and z^. Once 
i* has been found, it is included in the index set. Finally, the reduced QP 
corresponding to the MEB of the new approximate coreset is solved. 

Algorithm [T] has two main sources of computational overhead: the computa- 
tion of the furthest point from c^, which is linear in m, and the solution of the 



optimization subproblem in step 10 The complexity of the former step can be 
made constant and independent of m by suitable sampling techniques (see [37]). 
an issue to which we will return later. As regards the optimization step, CVMs 
adopt a SMO method, where only two variables are selected for optimization 
at each iteration [51 [3D] . It is known that the cost of each SMO iteration is not 
too high, but the method can require a large number of iterations in order to 
satisfy reasonable stopping criteria [30] . 

Algorithm 1 BC Algorithm for MEB-SVMs: the Core Vector Machine 
Input: S, e. 

1: initialization: compute Iq and a.o; 

2: i — A:(xi,xi); 

3: i?o i — Ej.jeio "o,«aojfc(xi,Xj); 

4: r2 ^ A2 - i?o; 

5: i* < — argmax,gx7^(ao;*) + i?o - 2 X^^gio aojMxj, Xi); 

6: ki — 0; 

7: while 7^(Q;fe;i*) > (1 + e)^r^ do 

8: k < — k + 1; 

9: Ifc ^Xfe_iU{z*}; 

10: Find otk by solving the reduced QP problem 



minimize R{q.) :^ aiajk{xi,Xj) 
i.jeik 

subject to Q;; = 1, Q!i > 0, i Ik': 



(21) 



Rk < i?(Q!fc); 

rl^A^-Rk, 

argmaxjgx7^("fc;i) A'^ + Rk -'^Y.jei^ ak,jk{xj,Xi); 



I 

end while 
return Is = Tk, ot = otk 



As regards the initialization, that is, the computation of Cq and ckq, a simple 
choice is suggested in [5T], which consists in choosing Cq = {za,Zb}, where z^ is 
an arbitrary point in (p{S) and Zb is the farthest point from z^. Obviously, in 
this case the center and radius of are Cq = 0.5(za-|-Z6) and tq — 0.5||za — z;,||, 
respectively. That is, we initialize Ig — {a, &}, ao,a = cto,b ~ 0.5 and ag.i = 
for i ^ Xq. a more efficient strategy, implemented for example in the code 
LIBCVM [36 , is the following. The procedure consists in determining the MEB 
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of a subset P = {zi ,i S Xp} of p training points, where the set of indices Ip is 
randomly chosen and p is small. This MEB is approximated by running a SMO 
solver. In practice, p ~ 20 is suggested to be enough, but one can also try larger 
initial guesses, as long as SMO can rapidly compute the initial MEB. Cq is then 
defined as the set of points x.i G P gaining a strictly positive dual weight in the 
process, and Iq as the set of the corresponding indices. 



4 Frank- Wolfe Methods for the MEB-SVM Prob- 
lem 

4.1 Overview of the Prank- Wolfe Algorithm 

The Frank- Wolfe algorithm (FW), originally presented in [TT], is designed to 
solve optimization problems of the form 

maximize /(ct) , (22) 

where / e C^(M™) is a concave function, and S 7^ a bounded convex polyhe- 
dron. 

In the case of the MEB dual problem, the objective function is quadratic 
and S coincides with the unit simplex. Given the current iterate € S, a 
standard Frank- Wolfe iteration consists in the following steps: 

1. Find a point e E maximizing the local linear approximation V'fc(u) := 
f{ak) + (u - ak)'^\/f{ak), and define df ^ = - otk- 

2. Perform a line-search Afe = argmax;^gjQ f{a.k + Ad^^). 

3. Update the iterate by 

otk+i = ccfe + Afedf ^ = (1 - Afc)afe + A^u^ . (23) 



The algorithm is usually stopped when the objective function is sufficiently close 
to its optimal value, according to a suitable proximity measure |13) . 

Since V'fc(u) is a linear function and S is a bounded polyhedron, the search 
directions d^*^ are always directed towards an extreme point of E. That is, 
Ufc is a vertex of the feasible set. The constraint A^ € [0, 1] ensures feasibility 
at each iteration. It is easy to show that in the case of the MEB problem 
Ufe = Bj* , where denotes the i-th vector of the canonical basis, and i* is the 
index corresponding to the largest component of Wf{cXk) [41j- The updating 
step therefore assumes the form 

ctk+i = (1 - Xk)ak + Afce,. . (24) 

It can be proved that the above procedure converges globally [13]. As a 
drawback, however, it often exhibits a tendency to stagnate near a solution. 



Intuitively, suppose that solutions a* of (22) lie on the boundary of E (this is 
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often true in practice, and holds in particular for the MEB problem). In this 
case, as ak gets close to a solution a*, the directions become more and 

more orthogonal to V f{a.k)- As a consequence, cxk possibly never reaches the 
face of S containing a* , resulting in a sublinear convergence rate |13j . 



4.2 The Modified Frank- Wolfe Algorithm 

We now describe an improvement over the general Frank- Wolfe procedure, which 
was first proposed in 40J and later detailed in [T3]. This improvement can be 
quantified in terms of the rate of convergence of the algorithm and thus of the 
number of iterations in which it can be expected to fulfill the stopping conditions. 

In practice, the tendency of FW to stagnate near a solution can lead to 
later iterations wasting computational resources while making minimal progress 
towards the optimal function value. It would thus be desirable to obtain a 
stronger result on the convergence rate, which guarantees that the speed of the 
algorithm does not deteriorate when approaching a solution. This paragraph 
describes a technique geared precisely towards this aim. 

Essentially, the previous algorithm is enhanced by introducing alternative 
search directions known as away steps. The basic idea is that, instead of moving 
towards the vertex of S maximizing a linear approximation ij^k of / in otk , we can 
move away from the vertex minimizing ip)^. At each iteration, a choice between 
these two options is made by choosing the best ascent direction. The whole 
procedure, known as Modified Frank-Wolfe algorithm (MFW), can be sketched 
as follows: 

1. Find Ufc G S and define d^^ as in the standard FW algorithm. 

2. Find vj. G E by minimizing ip}~(y), s.t. Vj = if akj — 0. Define 



d^ = ccfe - Vfc. 



3. If V/(afc)^df ^ > V/(afe)^d^, then d,. = d^^ , else dfc = d 



A 
k ■ 



4. Perform a line-search Afe = argniax_;^gjQ /(ctfc + Adfc), where A = 1 if 
dfc = df ^ and A = maxA>o {A | ctfc + Ad^ G S}. 

5. Update the iterate by 

/ (l-Afc)afc + AfcUfc if dfc=df^ 
Ofc+i = "fe + Afcdfc = < (25) 
[ (1 + Afc)Q;fc - AfcVfc if dfc^dj.. 

It is easy to show that both d^^ and d^ are feasible ascent directions, unless 
afc is already a stationary point. 

In the case of the MEB problem, step [2] corresponds to finding the basis 
vector Gj. corresponding to the smaller component of V/(afc) [41 . Note that a 
face of E of lower dimensionality is reached whenever an away step with maximal 
stepsize A is performed. Imposing the constraint in step [2] is tantamount to 
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ruling out away steps with zero stepsize. That is, an away step from cannot 
be taken if akj is aheady zero. 

In [13j hnear convergence of f{ak) to f{cx*) was proved, assuming Lipschitz 
continuity of V/, strong concavity of /, and strict complementarity at the so- 
lution. In [IT], a proof of the same result was provided for the MEB problem, 
under weaker assumptions. It is important to note that such assumptions are in 
particular satisfied for the MEB formulation of the L2-SVM, and that as such 
the aforementioned linear convergence property holds for all problems consid- 
ered in this paper. In particular, uniqueness of the solution, which is implied if 
we ask for strong (or just strict) concavity, is not required. The gist is essentially 
that, in a small neighborhood of a solution a*, MFW is forced to perform away 
steps until the face of S containing a* is reached, which happens after a finite 
number of iterations. Starting from that point, the algorithm behaves as an 
unconstrained optimization method, and it can be proved that /(ctfe) converges 
to /(a*) linearly [T3]. 



4.3 The FW and MFW Algorithms for MEB-SVMs 

If FW method is applied to the MEB dual problem, the structure of the objective 
function $(«) can be exploited in order to obtain explicit formulas for steps [T] 
and [2] of the generic procedure. Indeed, the components of V^{ak) are given 

by 

V$(afe), = ||zj2_2^afc,,zfz, = ||zj2 _ 2zf Cfe , (26) 

3 ex 



where 



Ck=Yl ""k.jZj , (27) 



and therefore, since ||cfe|| does not depend on i, 

i* ~ argmax V<I'(Q:fc).i — argmax||zi — Cfe|p. (28) 

iei iei 

In practice, step [T] selects the index of the input point maximizing the distance 
from Cfc, exactly as done in the CVM procedure. Computation of distances can 
be carried out as in CVMs, using (20). As regards step[2j it can be shown (see 
HIIT]) that 

If. rl ' 

Afc = 

where 



Hock) . (30) 



By comparing ( |27[ ) and (30) with (16), we argue that, as in the BC algorithm, 
a ball Bk — B{ck, r^) is identified at each iteration. 

The whole procedure is sketched in Algorithm [2j where at each iteration we 
associate to ccfc the index set 1^ = {i G I : ak,i > 0}. 
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Algorithm 2 Frank- Wolfe Algorithm for MEB-SVMs 
Input: 5*, e. 

1: initialization: compute Iq and ao; 

2: i — A:(xi,xi); 

3: i?o i — Ejjeio"o,»aojfc(xi,Xj); 

4: r2 ^ A2 - i?o; 

5: i* < — argmaXjgx7^(ao;*) ■= ^'^ + Rq - '2J2jeXo^o.]k{xj,Xi); 

6: 5o ^ - 1; 

7: fc < 0; 

8: while Sk > {1 + e)^ - 1 do 

9: Afc < ^ f 1 - 2/" -.^ 1 ; 

10: k < — k + 1; 

11: ak < — (1 - Afc_i)afc_i + Afc_iei.; 

12: Iki — {iel : ak,i > 0}; 

13: Rk < — J2i,jeik ctk,iak..jk(^ii^])\ 

15: i* i — argmaxjgx7^("fc;i) A^ + i?^. - 2J2j(zik akjH'^j^^i); 

16: 4 ^ ^^^^ - 1; 

17: end while 

18: return Ig = Ik, ot = otk- 



As regards the initialization, and Iq can be defined exactly as in the CVM 
procedure. At subsequent iterations, the formula to update Xk immediately 
follows from the updating (24 1 for a^; indeed, the indices of the strictly positive 
components of cxk+i are the same of cxk, plus i* if ak^i* = (which means that 
Zi. was not already included in the current coreset). The introduction of the 
sequence {Ik} in Algorithm [2] makes it evident that structure and output of 
Algorithm [T] are preserved. 

The updating formula used in step 14 appears in |41j . It is easy to see that 
it is equivalent to (30) and computationally more convenient. 

In |41j . it has been proved that is a monotonically increasing sequence 
with r*^ as an upper bound. Therefore, since the same stopping criterion of 
the BC algorithm is used, I5 identifies an e-coreset Cs of ip{S), and the last Bk 
is a (1 + e)-MEB of <p{S). However, the MEB-approximating procedure differs 
from that of BC in that the value of is not equal to the squared radius of 
MEB(Cfe), but tends to the correct value as a.k gets near the optimal solution 
(see Fig. [l]). 

The derivation of the MFW method applied to the MEB-SVM problem can 
be written down along the same lines. Following the presentation in |41| . we 
describe the detailed procedure in Algorithm [3j 

By now, it should be apparent that j* is the index identifying the point 
furthest from Cfc, and that it corresponds to the smallest component of V$(Q:fc). 
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Figure 1: Approximating balls computed by algorithms BC and FW. 



That is, in Algorithm [3] we consider performing away steps in which the weight 
of the nearest point to the current center is reduced. Of course, since the weight 
of a point is not allowed to drop below zero, the search for j* is performed on 
Ik only. Again, the optimal stepsize can be determined in closed form |41|. In 
particular, it is easy to see that the expression in step [19] corresponds to 

Xk — argmax $((1 + A)Q:fe — Ae^* ) , (31) 

where the upper bound on the interval preserves dual feasibility. 

This kind of step has an intuitive geometrical meaning: if we consider a 
solution a* of the MEB problem, it is known that nonzero components of a* 
correspond to points lying on the boundary of the exact MEB. Therefore, it 
makes sense to try to remove from the model points that lie near the center 
(i.e. far from the boundary of the ball). When an away step is performed, 
if Afe is chosen as the supremum of the search interval, we get a^+ij-' = 
and the corresponding example is removed from the current coreset (drop step). 



Moreover, it's not hard to see that step 11 chooses to perform an away step 
whenever 

V$(a..)^d^ > V$(«,)^dr>^. (32) 

That is, the choice between FW and away steps is done by choosing the best 
ascent direction, exactly as required by the MFW procedure. Here d|"^ = 
(e^. — a.k) and = {a.k ~ Gj*) denote the search directions of FW and away 
steps, respectively. Finally, step [20] shows that, just as with standard FW 
steps, after performing an away step we can use an analytical formula to update 
r^. This expression follows easily by writing the objective function $(a) for 
a = (1 + X)ak — Agj.. 

In Fig. 2, we try to give a geometrical insight on the difference between FW 
and away steps in terms of search directions. 



16 



Algorithm 3 Modified Frank- Wolfe Algoritlini for MEB-SVMs 
Input: 5*, e. 
1: initialization: compute Iq and cxq; 
A2^fc(xi,xi); 
■Ro < — Ejjeio ao,»aojfc(xi,Xj); 

i* < — argmaXjgx7^(ao;*) := + i?o - 2 X^jgi^ aojMxj, Xi) 
j* < — argmin^gxe 7^(ao;i); 

"0+ < 12 J-, 



8 
9 
10 
11 

12 

13 

14; 

15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 

27 
28 
29 



' k 

ki — 0; 

while Sk+ > (1 + e)^ - 1 do 
if Sk+ > 4- then 



k i fc + 1; 

Oik < — (1 - Afc-i)afc-i + Afe-ie^.; 
(1 + -Jl^i±—]- 

1 V ^ 4(l+5fc_i + )^' 



4 ^ 4- 



else 

^fe^minj^^^^,^^}; 
fc < — fc + 1; 

Oik < — (1 + Afc-i)Q:fe-i - Afe_iej. ; 
'i ^ (1 + Afe-i)rLi - Xk-iil + Xk-i){6k-i- - 
end if 

^k< — {iel : ak,i > 0}; 

Rk < — Ejjsi, afc,,afc,jfc(x,,Xj); 

i — argmax^gx7^("fc;*) + i?^. - 2J2j(zik akjH'^j^^i); 

3* < — argmin^gi^ 72(0:4.; j); 

4+ ^ ^^^^ - 1; 

end while 

return I5 = 1^, a = ccfc. 



We previously hinted at the linear convergence properties of MFW. This 
result can now be stated more precisely |41j . 

Proposition 1. At each iteration of the MFW algorithm, we have: 

- $(afc+i) 



$(a) - ^{otk) 



< M , (33) 



where M < 1 — „g , (3 is a constant and ds = diam(S')2. 
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Figure 2: A sketch of the search directions used by FW and MFW. 



Substantially, as shown in the convergence analysis of [13], there exists a 
point in the optimization path of the MFW algorithm after which only away 
steps are performed. That is, the algorithm only needs to remove useless exam- 
ples to correctly identify the optimal support vector set. From this stage on, 
the algorithm converges linearly to the optimum value of the objective function. 
In contrast, the standard FW algorithm does not possess the explicit ability to 
eliminate spurious patterns from the index set, and tends to slow down when 
getting near the solution. 



4.4 Beyond Normalized Kernels 



The methods studied in this paper were originally motivated by recent advances 
in computational geometry that led to efficient algorithms to address the MEB 
problem [41 . However, a major advantage of the proposed methods, over e.g. 
the CVM approach, is that both the theory and the implementation of our 
algorithms can be applied without changes to train SVMs using kernels which 
do not satisfy condition (17), imposed to obtain the equivalence between the 



MEB problem ( 15 ) and the SVM optimization problem ( 12 ) 



Both the FW and MFW methods were designed to maximize any differen- 
tiable concave function f{a) in a bounded convex polyhedron. The objective 



function in the SVM problem ( 12 1 is concave and the set of constraints coincides 
with the unit simplex. The proposed methods can thus be applied directly to 
solve (12 1 without regard to (15). Theoretical results such as the global conver- 



gence of algorithms still hold. In addition, since strict complementarity usually 
holds for SVM problems, results in fT3\ imply that MFW still converges linearly 
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to the optimum. Note also that the constant A^, which makes the difference 



between ( 15 ) and ( 12 1 for normahzed kernels, can still be added to the objective 
function of ( 12 1 in the case of non-normalized kernels, since it is simply ignored 
when optimizing for a. An implementation designed to handle normalized ker- 
nels can thus be directly used with any Mercer kernel. 

It is apparent that the geometrical interpretation underlying Algorithms [2] 
and [3] needs to be reformulated if the SVM problem is no longer equivalent to 
the problem of computing a MEB. However, it is easy to show that the search 
direction of the FW procedure at iteration k is still ei» where the index i* 
corresponds to the largest component of ^ f{af^). Similarly, the away direction 
explored by MFW at iterate k is still e^. where the index j* corresponds to the 



smallest component of V/(Q:fe). The set of constraints in problem ( 12 ) coincides 



with that of (15 1. In addition, any approximate solution 0.^ produced by the 
proposed algorithms is feasible. Thus, the sequence f{ak) is strictly increasing 
and converges from below to the optimum f{a*). It is not immediately evident, 
however, whether the stopping condition used within our algorithms guarantees 
the method to find a solution in a neighborhood of the optimum a^. We now 
show that this is indeed the case. For simplicity of notation, it is convenient to 
write explicitly the target QP of MEB-SVMs in matrix form: 



maximize — olKol = A^ — ||cjp 

subject to a = 1, a > , 



(34) 



where K is the matrix of entries kij — (p{xi)'^Lp{xj) = yiyjk{xi,Xj) + yiyj + 
c = Za, and Z is the matrix whose columns are the feature vectors z.; = </'(xi). 
Note that K = Z'^Z. When k is a normalized kernel, we get ~ g{oi). 

For non-normalized kernels, instead, A^ can be viewed as an arbitrary constant 



added to the SVM objective function in (12 1, 8(0:) = cxKct. That is, we can 
always think of g{a) as the objective function when solving ( [l2| . 

It is not hard to see that the stopping condition used in Algorithms [2] and |3] 
can be written as follows: 

4+ < (l + e)'-l 
^ ||z,. -c,||2 < (1 + 6)^2 (35) 
^ A2-2zf.Cfc + ||cfe||2<(l + e)V^ 

Since by construction ~ A^ — ||cfe|p, we get 

A2-2zf.Cfe + ||cfe|p<(l + 2e + e2)(A2 
^ A^ - 2zf.c,. + Help <il + e) (A^ - Help) 
^ A2 - 2zf.Cfc + Ijcfelp < A2 - \\ckf+egiak) 
-2zf.Cfc-t-2||cfc|P < eg(afc), 



(36) 



with e ~ 2e + e'^ — 0{e). Now, since V g{oLk) = —2Ka.k = —2Z'^Ck, we 
have that Vg{(y.k)i-' = — 2z^Cfc. In addition, a.'[Wg{ak) = —2ot[ Kotk = 
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—2a.JZ^Zak = — 2c^Cfe = — 2|lcA;|p. Thus, the stopping condition for both 
algorithms is equivalent to 

V5(afe),. - alVgiak) < e.g(afc) . (37) 

On the other hand, since the objective function g{a) is concave and differen- 
tiable, 

.9(0:*) < g(afc) + {a* - akf V<?(afc) . (38) 

In addition, e^ot* — 1 and thus a.*'^'Vg{Q.k) < maxigx V.g(Q:fc)i — Wg{cy.k)i' ■ 
Therefore, 

g{a*) < g{a.k) + a*'^Vg{a.k) - alVg{oLk) ^^^^ 
< 9{oLk) + V5(afe),. - alVg{a.k) ■ 



In virtue of (37 1 and (39), we obtain that 

g{a.*) < g{cxk) + eg{a.k) = (1 + e)g{a.k) . (40) 

Finally, from the feasibility of a.k, we have g{cx.k) < g{c».*). Therefore, 

(l-£)g(a*)<5(afc)<5(a*), (41) 

that is. Algorithms [2] and [3] stop with an objective function value g{oLk) in a left 
neighborhood of radius g{oL*)£ — g{a.*){2e + e^) = 0{e) around the optimum, 
even if the target problem (fT2|) is not equivalent to a MEB problem. 



5 Experiments 

We test all the classifications methods discussed above on several classification 
problems. Our aim is to show that, as long as a minor loss in accuracy is 
acceptable, Frank- Wolfe based methods are able to build L2-SVM classifiers in 
a considerably smaller time compared to CVMs, which in turn have been proven 
in [37] to be faster than most traditional SVM software. This is especially 
evident on large-scale problems, where the capability to construct a classifier in 
a significantly reduced amount of time may be most useful. 

5.1 Organization of this Section 

After discussing several implementation issues we compare the performance of 
the studied algorithms on several classical datasets. Our experiments include 
scalability tests on two different collections of problems of increasing size, which 
assess the capability of Frank- Wolfe based methods to efficiently solve problems 
of increasingly large size. These results can be found in Subsecs. |5.3| and [5.4[ In 
Subsection |5 . 5 1 we present additional experiments on the set of problems studied 
in [TU]. The statistical significance of the results presented so far is analyzed 



in section 5.6 A separate test is then performed in Subsection |5.7| to study 



the influence of the penalty parameter C on each training algorithm. Finally, 
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in Subsection 5.8 we present some experiments showing the capability of FW 
and MFW methods to handle a wider family of kernel function with respect to 
CVMs. We highlight that the purpose of that paragraph is not to improve the 
accuracy or the training time of the algorithms. A detailed commentary on the 
obtained results, which summarizes and expands on our conclusions, closes the 
section in Subsection 15.91 



5.2 Datasets and Implementation Issues 

As we detail below, all the datasets used in this section have been widely used 
in the literature. They were selected to cover a large variety with respect to the 
number of instances, number of dimensions and number of classes. In most of 
the cases, the training and testing sets are standard (precomputed for bench- 
marking) and can be obtained from public repositories like [3] , |15) , or others we 
indicate in the dataset descriptions. The exceptions to this rule are the datasets 
Pendigits and KDD99-10pc. In these cases, the testing set was obtained 
by random sampling from the original collection a 20% of the items. All the 
examples not selected as test instances were employed for training. 

For each problem, we specify, in Tab. [T] the number m of training points, 
the input space dimension n, and the number of classes K. We indicate by t 
the number of examples in the test set, which is used to evaluate the accuracy 
of the classifiers but never employed for training or parameter tuning. In the 
case of multi-category classification problems, we adopted the one-versus-one 
approach (OVO), which is the method used in [57] to extend CVMs beyond 
binary classification and that usually obtains the best performance both in terms 
of accuracy and training time according to p/7j. Hence, for these cases we also 
report the size mmax of the largest binary subproblem and the size m,„in of the 
smallest binary subproblem in the OVO decomposition. 

Here follow some brief descriptions of the pattern recognition problems un- 
derlying each dataset, taken from their respective sources. 

• USPS, USPS-Ext - The USPS dataset is a classic handwritten digits 
recognition problem, where the patterns are 16 x 16 images from United 
States Postal Service envelopes. The extended version USPS-Ext first 
appeared in |37j to show the large-scale capabilities of CVMs. The original 
version can be downloaded from [3] and the extended one from [36' . 

• Pendigits - Another digit recognition dataset, created by experimentally 
collecting samples from a total of 44 writers with a tablet and a stylus. 
This dataset can be obtained from |15) . 

• Letter - An Optical Character Recognition (OCR) problem. The objec- 
tive is to identify each of a large number of black-and-white rectangular 
pixel displays as one of the 26 capital letters in the English alphabet. The 
files can be obtained from [3]. 

• Protein - A bioinformatics problem regarding protein structure predic- 
tion. This dataset can be download from [3]. 
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Table 1: Features of the selected datasets. 



• Shuttle - This is a dataset in the Statlog collection, originated from NASA 
and concerning the position of radiators within the Space Shuttle (27) . The 
dataset can be obtained from [TS] or [55] . 

• IJCNN - A dataset from the 2001 neural network competition of the 
International Joint Conference on Neural Networks. We obtained this 
dataset from 157 . 

• MNIST - Another classic handwritten digit recognition problem, this 
time coming from National Institute of Standards (NIST) data. The 
dataset can be obtained from [55] . 

• KDD99-10pc, KDD99-Full - This is a dataset used in the 1999 Knowl- 
edge Discovery and Data Mining Cup. The data are connection records 
for a network, obtained by simulating a wide variety of normal accesses 
and intrusions on a military network. The problem is to detect different 
types of accesses on the network with the aim of identifying fraudulent 
ones. The lOpc version is a randomly selected 10% of the whole data. 
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• Reuters - A text categorization problem built from a collection of docu- 
ments that appeared on Reuters newswire in 1987. The documents were 
assembled and indexed with categories. The binary version used in this 
paper (relevant versus non-relevant documents) was obtained from |36j . 

• Adult ala-a8a - A series of problems derived from a dataset extracted 
from the 1994 US Census database. The original aim was to predict 
whether an individual's income exceeded 50000US$/year, based on per- 
sonal data. All the instances of this collection can be downloaded from 

m- 

• Web wla-w8a - A series of problems extracted from a web classification 
task dataset, first appeared in Piatt's paper on Sequential Minimal Opti- 
mization for training SVMs [3D]. All the instances of this collection can 
be downloaded from j36) . 



5.2.1 SVM Parameters 



For the experiments presented in Subsection 5.3 to Subsection 5.7 SVMs were 
trained using a RBF kernel fc(xi,X2) = exp(— ||xi — X2|P/2(T^). The reason for 
this choice is that this kernel is the best-known in the family of kernels admitted 
by CVMs and it is frequently used in practice [37]. In particular, this is the 
choice for the large set of experiments presented in [37] to demonstrate the 
advantage of this framework on other SVM software. However, in Subsection 
|5.8| we present some results showing the capability of FW and MFW methods 
to handle a polynomial kernel, which does not satisfy the conditions required 
by CVMs. 

For the relatively small datasets Pendigits and USPS, parameter was 
determined together with parameter C of SVMs using 10-fold cross-validation 
on the logarithmic grid [2~^^, 2^] x [2~^, 2^^], where the first collection of values 
correspond to parameter cr^ and the second to parameter C. For the large-scale 
datasets, was determined using the default method employed by CVMs in 
[37] . that is, it was set to the average squared distance among training patterns. 
Parameter C was determined on the logarithmic grid [2°, 2^^] using a validation 
set consisting in a randomly computed 30% fraction of the training set. 

We stress that the aim of this paper is not to determine an optimal value of 
the parameters by fine-tuning each algorithm on the test problems to seek for 
the best possible accuracy. As our intent is to compare the performance of the 
presented methods and analyze their behavior in a manner consistent with our 
theoretical analysis, it is necessary to perform the experiments under the same 
conditions on a given dataset. That is to say, the optimization problem to be 
solved should be the same for each algorithm. For this reason, we deliberately 
avoided using different training parameters when comparing different methods. 
Specifically, parameters and C were tuned using the CVM method and the 
obtained values were used for all the algorithms discussed in this paper (CVM, 
FW and MFW). 
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Furthermore, since the value of parameter C can have a significant influence 
on the running times, we devote a specific subsection to evaluate the effect of 
this parameter on the different training algorithms. 



5.2.2 MEB Initialization and Parameters 

As regards the initialization of the CVM, FW and MFW methods, that is, the 
computation of Iq and olq in Algorithms [ij [2] and [3j we adopted the random 
MEB method described in the previous sections, using p — 2Q points. As 
suggested in |37], we used e = 10~^ with all the algorithms. 



5.2.3 Random Sampling Techniques 



Computing i*, i.e. evaluating (201 for all of the m training points, requires 
a number of kernel evaluations of order 0{ql + mqk) — 0{mqk), where qk is 
the cardinality of Ik- If to is very large, this complexity can quickly become 
unacceptable, ruling out the possibility to solve large scale classification prob- 
lems in a reasonable time. A sampling technique, called probabilistic speedup^ 



was proposed in [34j to overcome this obstacle. In practice, the distance (20 1 is 
computed just on a random subset 1^9(5") C (p{S), where S' is identified by an 
index set X' of small constant cardinality r. The overall complexity is thereby 
reduced to order 0{ql + qk) = 0{q1), a major improvement on the previous 
estimate, since we generally have qk ^ m. The main result this technique relies 
on is the following [55] , 

Theorem 1. Let D := {di, . . . , d„i} C M &e a set of cardinality m, and let 
D' <Z S be a random subset of size r. Then the probability that maxD' is 
greater or equal than in elements of D is at least 1 — {^Y ■ 

For example, if r = 59 and m — 0.95TO,then with probability at least 0.95 
the point in ^p{S') farthest from the center lies among the 5% of the farthest 
points in the whole set (p{S). This is the choice originally made in 37] and used 
in [TU] to test the CVM and FW algorithms. 



5.2.4 Caching 

We also adopted the LRR caching strategy designed in [Mj for CVMs to avoid 
the computation of recently used kernel values. 



5.2.5 Computational Environment 

The experiments were conducted on a personal computer with a 2.66GHz Quad 
Core CPU and 4 GB of RAM, running 64bit GNU/Linux (Ubuntu 10.10). The 
algorithms were implemented based on the (C-|— 1-) source code available at 
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5.3 Scalability Experiments on the Web Dataset Collec- 
tion 

In Fig. [3j we report some results concerning accuracies, training times, speed- 
ups and support vector set sizes obtained in the Web datasets. The series 
is monotonicaUy increasing in the number of training patterns, which grows 
approximately as nii — 1.4'mo, « = 1, . . . , 8, where too is the number of training 
patterns in the first dataset [50] , 

The speed-up of the FW method with respect to CVMs is measured as 
s = to/ti where to is the training time of the CVM algorithm and ti is the 
training time of the FW method, both measured in seconds. Similarly, the 
speed-up of the MFW method with respect to CVMs is measured as s = 
where ^2 is the training time of the MFW method. 

As depicted in Fig. |3]the proposed methods are slightly less accurate than 
CVMs. The training time, in contrast, scales considerably better for our meth- 
ods as the number of training patterns increases. The speed-ups are actually 
always greater than 1, which shows that the FW and MFW methods indeed 
build classifiers faster than CVMs. More importantly, the speed-up is mono- 
tonicaUy increasing, ranging from 12 times faster up to 107 times faster in the 
case of the FW algorithm and from 2 times faster up to almost 10 times faster 
in the case of the MFW method. This suggests that the improvements of the 
proposed method over CVMs becomes more and more significant as the size of 
the training set grows. 

5.4 Scalability Experiments on the Adult Dataset Collec- 
tion 

Fig. |4] depicts accuracies and speed-ups obtained in the Adult datasets. Like 
the Web datasets, this collection was created with the purpose of analyzing 
the scalability of SVM methods and the number of training patterns grows 
approximately with the same rate [30]. The speed-up of the FW and MFW 
methods is computed as in the previous section. 

Results obtained in this experiment confirm that the proposed methods tend 
to be faster than CVMs as the dataset grows. CVMs are actually faster than 
FW just in two cases, corresponding to the smallest versions of the sequence. 
MFW however runs always faster than CVMs, reaching a speed-up of 15 x in 
the fifth version of the series. The speed-ups obtained by the FW method are in 
this experiment more moderate than in the Web collection. However, most of 
the time FW exhibits also better test accuracies than CVMs. Finally, the MFW 
algorithm is not only faster but also as accurate as CVMs on this classification 
problem. 

5.5 Experiments on Single Datasets 

Results of Figs. [5] and |6] correspond to accuracies and speed-ups obtained in the 
single datasets described in Tab. [T] that is, all of them but the Web and Adult 
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scries. Most of these problems have been aheady used to show the improvements 
of CVMs over other algorithms to train SVMs . 

Results show that the proposed methods are faster than CVMs most of 
the time, sometimes at the price of a slightly lower acciiracy. The speed-up 
achieved by FW and MFW becomes more significant as the size of the training 
set grows. FW in particular reaches peaks of 102 x in the second largest dataset 
(KDD99-10pc) and 25 x in the largest of the problems studied in this exper- 
iment (KDD99-Full). Finally, the results show that in the cases for which 
CVMs are faster, the advantage of this algorithm on FW and MFW tend to 
be very small, reaching its better speed-up against MFW in the USPS-Ext 
dataset, for which however FW exhibits a speed up of around 3x. 



Dataset 


CVM 




FW 


MFW 




Acc. 


STD. 


Acc. 


STD. 


Acc. 


STD. 


ala 


83.52 


6.33E-003 


83.52 


7.53E-003 


83.52 


1.58E-003 


a2a 


83.55 


1.15E-002 


83.56 


2.39E-002 


83.56 


9.19E-003 


a3a 


83.40 


8.45E-003 


83.39 


4.17E-002 


83.41 


5.00E-003 


a4a 


84.06 


1.59E-002 


84.10 


4.64E-002 


84.07 


1.15E-002 


a5a 


84.02 


8.92E-003 


84.03 


2.93E-002 


84.05 


9.86E-003 


a6a 


84.28 


3.51E-003 


84.26 


2.49E-002 


84.25 


1.86E-002 


a7a 


84.18 


1.06E-002 


84.23 


4.41E-002 


84.29 


2.26E-002 


wla 


97.80 


2.81E-003 


97.31 


8.54E-002 


97.65 


2.97E-001 


w2a 


98.06 


1.37E-003 


97.42 


2.58E-001 


97.80 


2.93E-001 


w3a 


98.21 


1.67E-003 


97.31 


6.48E-002 


97.60 


4.83E-001 


w4a 


98.28 


9.44E-004 


97.46 


2.17E-001 


97.91 


4.79E-001 


w5a 


98.39 


2.01E-003 


97.50 


2.71E-001 


98.36 


1.76E-002 


w6a 


98.72 


5.63E-003 


97.40 


3.45E-001 


98.43 


4.54E-001 


w7a 


98.73 


5.29E-003 


97.43 


2.43E-001 


97.88 


6.60E-001 


w8a 


99.34 


5.35E-003 


97.80 


2.82E-001 


97.59 


4.32E-001 


Letter 


97.48 


2.19E-002 


96.54 


1.37E-001 


97.35 


1.50E-001 


Pendigits 


98.35 


9.46E-002 


97.68 


9.39E-002 


97.65 


1.22E-001 


USPS 


95.63 


3.73E-002 


95.12 


1.05E-001 


95.47 


8.34E-002 


Routers 


97.10 


4.11E-002 


96.40 


1.53E-001 


95.60 


6.13E-001 


MNIST 


98.46 


3.14E-002 


97.91 


5.99E-002 


98.36 


4.05E-002 


Protein 


69.79 


O.OOE-hOO 


69.73 


O.OOE+00 


69.78 


O.OOE-fOO 


Shuttle 


99.67 


1.51E-001 


98.08 


6.74E-001 


97.82 


1.54E+00 


IJCNN 


98.59 


4.89E-002 


95.71 


7.95E-001 


97.31 


3.63E-001 


USPS-Ext 


99.50 


1.26E-002 


99.30 


5.47E-002 


99.57 


2.76E-002 


KDDlOpc 


99.87 


2.06E-002 


98.82 


2.1.3E-0ni 


99.10 


2.86E-001 


KDD-tuU 


1)1.77 


7.i7E-l)()2 




l.liE+OO 


91.82 


7.72E-002 



Table 2: Test accuracy (%) of the proposed algorithms and the baseline method 
CVM. Statistics correspond to the mean (Acc) and standard deviation (STD) 
obtained from 5 repetitions of each experim(^nt. For the Protein dataset, just 1 
repetition was carried out due to the significantly longer training times. 
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Dataset 


CVM 




FW 


MFW 




Time 


STD. 


Time 


STD. 


Time 


STD. 


ala 


6.26 


8 89F,-09 


12.5 


6.99E-01 


0.712 




a2a 


16 


1 89R-01 


19.3 


1.51E+00 


1.46 


1.41E-02 


a3a 


oo.o 


1.74E-01 


26.8 


1.44E+00 


2.78 




a4a 


8Q 1 




40.4 


1.19E+00 


6.37 




a5a 


1 71 




56.1 


1.94E+00 


11.1 


1 7^R ni 


a6a 




9 R9P,-Lnn 


164 


6.24E+00 


45 




a7a 




9 7qp,_^ni 


1420 


5.86E+01 


ouo 




wla 


3.59 


3 fi9F,-ni 


0.286 


8.26E-02 


1.67 


7 69F,-01 

( .yj^iij \j±. 


w2a 


7.9 


1 fi9F,-ni 


0.658 


3.32E-01 


2.31 


1 9nF,+00 


w3a 


12.8 




0.76 


5.90E-02 


2.27 




w4a 




7 ^OP, 01 

( .OUJ-J U-L 


1.26 


5.01E-01 


fi 7fi 




w5a 






1.78 


1.07E+00 


1 '\ R 


1 44F,4-00 


w6a 


1 'HI 




2.67 


2.36E+00 


29.4 


1 "^IF^OI 


w7a 


215 


1.17E+01 


2.58 


l,g9E+00 


91.4 


1.31E+02 


w8a 


1030 


6.05E+01 


9.64 


5.22E+00 


111 


1.27E+02 


Letter 


23.7 


2.80E-01 


13.3 


2.05E-01 


12.3 


1.42E-01 


Pendigits 


0.554 


3.26E-02 


0.82 


2.97E-02 


0.658 


2.23E-02 


USPS 


6.89 


7.46E-02 


7.58 


1.42E-01 


7.22 


9.00E-02 


Reuters 


7.24 


3.62E-01 


2.17 


3.87E-01 


1.69 


5.87E-01 


MNIST 


364 


1.31E+01 


301 


8.56E+00 


349 


2.59E+00 


Protein 


247000 


O.OOE+00 


11900 


O.OOE+00 


2000 


O.OOE+00 


Shuttle 


1.41 


3.41E-01 


1.69 


4.56E-01 


0.176 


2.73E-02 


IJCNN 


198 


1.36E+01 


40.5 


2.27E+01 


34.4 


1.26E+01 


USPS-Ext 


84.4 


2.02E+01 


26.7 


3.74E+00 


161 


1.49E+01 


KDDlOpc 


42.3 


3.58E+00 


0.414 


1.50E-02 


1.22 


1.24E+00 


KDD-fuU 


19.5 


8.12E+00 


0.764 


2.42E-02 


0.744 


8.00E-03 



Table 3: Running times (seconds) of the proposed algorithms and the baseline 
method CVM. Statistics correspond to the mean (Time) and standard deviation 
(STD) obtained from 5 repetitions of each experiment. For the Protein dataset, 
just 1 repetition was carried out due to the significantly longer training times. 



For sake of readability we include in Tab. [2] and Tab. [3] a summary of the 
test accuracy and running times used to build the Figs. [3]to|6] 
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wla w2a w3a w4a w5a w6a w.'S vj8a 




wla w2a w3a w4a w5a w6a w7a w8a 



i400 




wla w2a w3a w4a w5a w6a w7a w8a 




wla w2a w3a w4a w5a w6a w7a w8a 



Figure 3: Comparison of accuracies (first row), speed-ups (second row), absolute 
running times (third row) and sizes of training sets and support vector sets 
(fourth row) in the Web datasets. 
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a1a a2a a3a a4a a5a aSa a:a 
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ala a2a a3a a4a aSa a6a a7a 




ala a2a a3a a4a aSa a6a a7a 




ala a2a a3a a4a a5a aSa a7a 



Figure 4: Comparison of accuracies (first row), speed-ups (second row), absolute 
running times (third row) and sizes of training sets and support vector sets 
(fourth row) in the Adult datasets. 
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letter pendigits Ljsps mnist shuttle ijcnn usps-ext reulers 



8.0X 

I 




0.8*: ^^^^^ 
pendigits usps 



4.gxH 

F 



J 



mnist shuttle ijcnn usps-ext reuters 



Figure 5: Comparison of accuracies (first row) and speed-ups (second row) 
obtained in some of the single datasets of Tab. [l] 




kddlOpc 




kddlOpc 



Figure 6: Comparison of accuracies (first column) and speed-ups (second col- 
umn) obtained in some of the single datasets of Tab. [T] 
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5.6 Statistical Tests 



This paragraph is devoted to verify the statistical significance of the results 
obtained above. To this end we adopt the guideUnes suggested in \7\, that is, 
we first conduct a multiple test to determine whether the hypothesis that all the 
algorithms perform equally can be rejected or not. Then, we conduct separate 
binary tests to compare the performances of each algorithm against each other. 
For the binary tests we adopt the Wilcoxon Signed-Ranks Test method. For 
the multiple test we use the non-parametric Friedman Test. In Demsar 
recommends these tests as safe alternatives to the classical parametric t-tests 
to compare classifiers over multiple datasets. 

The main hypothesis of this paper is that our algorithms are faster than 
CVM. We have also observed than they are slightly less accurate. Therefore, 
our design for the binary tests between our algorithms and CVM is that of 
Tab. [7] As regards the comparison of the proposed methods, there is no an 
apparent advantage in terms of running time of one against the other. MFW 
seems however more accurate than FW. We thus conduct a two-tailed test for 
the running times but adopt a one-tailed test for testing accuracy. 





FW 


vs. CVM 


Time 


Ho 


FW and CVM are equally fast 






FW is faster than CVM 


Accuracy 


Ho 


FW and CVM are equally accurate 




Hi 


CVM is more accurate than FW 




MFW vs. CVM 


Time 


Ho 


MFW and CVM are equally fast 




Hi 


MFW is faster than CVM 


Accuracy 


Ho 


MFW and CVM are equally accurate 




Hi 


CVM is more accurate than MFW 




FW 


vs. MFW 


Time 


Ho 


FW and MFW are equally fast 




Hi 


Running times of MFW and FW are different 


Accuracy 


Ho 


FW and MFW are equally accurate 




Hi 


FW is less accurate than MFW 



Table 4: Null and alternative hypotheses for the binary statistical tests. 



In Tab. [5] we report the values of the tests statistics calculated on the 
26 datasets used in this paper. The critical values for rejection of the null 
hypothesis under a given significance level can be obtained in several books 
[351 . Here, in Tab. [g) we report the p- values corresponding to each testQ 

Note that in all but one case (binary test FW vs. MFW about running time) 
the p- values are lower than 0.01. Therefore, for most commonly used significance 
levels (0.01, 0.05, 0.1, or lower) we conclude that there are significant differences 
in terms of time and accuracy among the algorithms. Table [7] summarizes the 

^For reproducibility concerns, p- values were computed using the statistical software R I31| . 
For the Wilcoxon Signcd-Ranks Test, the exact p- values were preferred to the asymptotic ones. 
The Pratt method to handle ties is employed by default. In the case of the Friedman test, 
the Iman and Davenport's correction was adopted, as suggested in |Jj. 
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W statistic 


F statistic 




FWvs. CVM 


MFW vs. CVM 


FW vs. MFW 


FW,MFW,CVM 


Time 


17 


20 


159 


14.858 


Accuracy 


19 


48.5 


63.5 


11.879 



Table 5: Values of the W and F statistics for Wilcoxon Signed-Ranks Tests 
and Friedman Tests respectively. 





Binary Tests 


Multiple Tests 
FW,MFW,CVM 


FW vs. CVM 


MFW vs. CVM 


FW vs. MFW 


Time 

Accuracy 


3.085e - 06 
4.977e - 06 


5.528e - 06 
3.21e - 04 


0.6893 
1.23e- 03 


1.72e-05 
1.20e - 04 



Table 6: P-values corresponding to the statistical tests. 



conclusions from the binary tests. Note that the main hypothesis of this paper is 
confirmed. Most of the time our algorithms run faster and are less accurate. In 
the previous sections we have seen however that the loss in accuracy is usually 
lower than 1%, while the running time can be order of magnitudes better. As 
regards the comparison of the proposed algorithms FW and MFW, we cannot 
conclude that the difference in training time is statistically significant. However, 
we conclude that MFW is more accurate than FW. This last observation stresses 
the relevance of this work as an extension of the results presented in [TU] . 





FW vs. CVM 


Time 
Accuracy 


Ho rejected, so Hi : FW is faster than CVM 

Ho rejected, so Hi : CVM is more accurate than FW 




MFW vs. CVM 


Time 
Accuracy 


Ho rejected, so Hi : MFW is faster than CVM 

Ho rejected, so Hi : CVM is more accurate than MFW 




FW vs. MFW 


Time 
Accuracy 


We cannot reject Hg : FW and MFW are equally fast 
Ho rejected, so Hi : FW is less accurate than MFW 



Table 7: Conclusions from the binary statistical tests for significance levels 
0.01, 0.05, 0.1. 



5.7 Experiments on the parameter C 

Previous experiments have shown that parameter C used by SVMs to handle 
noisy patterns can have a significant impact on the training time required to 
build the classifier [10]. We hence conduct experiments on some datasets to 
study this effect in more detail. 

Figs. [7] and [8] show the training times and accuracies obtained in the Shut- 
tle, KDD99-10pc, Pendigits and Reuters datasets when changing the value 
of C. Results confirm the general effect of this parameter on the training time: 
as C grows all the algorithms become faster. However the training times of 
the proposed methods are most of time significantly lower than those of CVMs, 
independently of the value of parameter C used by the SVM. 
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2 4 8 16 32 64 128 256 512 1024 2048 4096 




2 4 8 16 32 64 128 256 512 1024 2048 



Figure 7: Test accuracies (first row) and training times (second row) obtained 
while changing the value of C in the Shuttle and KDD99-10pc datasets. 
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Figure 8: Test accuracies (first row) and training times (second row) obtained 
while changing the value of C in the Pendigits and Reuters datasets. 
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5.8 Experiments with Non-Normalized Kernels 

Solving a classification problem using SVMs requires to select a kernel function. 
Since the optimal kernel for a given application cannot be specified a priori, the 
capability of a training method to work with any (or the widest possible) family 
of kernels is an important feature. 

In order to show that the proposed methods can obtain effective models 
even if the kernel does not satisfy the conditions required by CVMs, we conduct 
experiments using the homogeneous second order polynomial kernel A;(xj,Xj) = 
(7xfxj)^. Here, parameter 7 is estimated as the inverse of the average squared 
distance among training patterns. Parameter C is determined as usual by using 
a validation grid search on the values 2°,2^,...,2^^. The test set is never used 
to determine SVMs meta-parameters. 

Note however that the purpose of this section is not to determine an optimal 
choice of the kernel function on the considered problems. The results presented 
below are merely indicative of the capability of the FW and MFW methods to 
handle a wide family of kernels, thus allowing for a greater flexibility in building 
a classifier. 

Tab. |8] summarizes the results obtained in some of the datasets used in this 
section. We can see that both test accuracies and training times are comparable 
to those obtained using the gaussian kernel. It should be noted that the CVM 
algorithm cannot be used to train an SVM using the kernel selected for this 
experiment and thus we only incorporate the FW and MFW methods in the 
table. These results demonstrate the capability of our methods to be used 
with kernels other than those satisfying the normalization condition imposed 
by CVMs. 

5.9 Discussion 

We now comment in more detail the results presented above. First of all we 
note that, most of the time, the proposed algorithms appear very competitive 
against CVM, with a tendency to favor training speed in large datasets, some- 
times at the expense of a little accuracy. CVMs appear faster than FW just in 
three problems among the single datasets studied in Subsection |5.5| Pendig- 
its, USPS and Shuttle. It should be noted however that the Pendigits and 
USPS datasets correspond to multi-category problems and are approached us- 
ing a decomposition method based on solving several binary subproblems. Now, 
as shown in Tab. [l] the greatest binary subproblem for these datasets, is smaller 
than all the problems of the Web collection and all but one of the Adult col- 
lection. When each subproblem is very small, SMO iterations are quite cheap, 
and the overall cost of running the BC procedure is reasonably low. In these 
cases, training with a CVM (or even with a traditional SMO-based SVM) pos- 
sibly constitutes a convenient choice. The advantage of FW-based methods lies 
instead in their capability to effectively handle larger problems, as results on 
the Web and Adult collections show. 

All the methods offer very similar testing performances on all the charac- 
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Dataset 


Algorithm 


Accuracy 


STD-Accuracy 


Time(s) 


STD-Time 


Shuttle 


FW 


96.58 


9.44E-01 


6.32E+01 


3.69E+01 


Shuttle 


MFW 


95.86 


9.33E-01 


8.52E+01 


1.29E+02 


Reuters 


FW 


95.80 


3.00E-01 


2.89E+00 


3.23E-01 


Reuters 


MFW 


95.90 


1.98E-01 


2.39E+00 


1.52E-01 


Web wla 


FW 


97.22 


1.08E-01 


7.60E-02 


3.50E-02 


Web wla 


MFW 


97.49 


1.47E-01 


2.52E-01 


1.34E-01 


Web w2a 


FW 


97.33 


1.65E-01 


1.42E-01 


l.llE-01 


Web w2a 


MFW 


97.09 


1.93E-01 


8.40E-02 


6.53E-02 


VVeO WOdL 


r vv 




1 .OO-CJ-UI 


Z.U^I1/-U1 


1 .OO-Cj-Ul 


Web w3a 


MFW 


97.22 


1.22E-01 


3.18E-01 


3.19E-01 


Web w4a 


FW 


97.16 


1.43E-01 


3.76E-01 


3.22E-01 


Web w4a 


MFW 


97.25 


1.49E-01 


3.74E-01 


3.65E-01 


Web w5a 


FW 


97.08 


7.37E-02 


1.54E-01 


2.14E-01 


Web w5a 


MFW 


97.11 


1.12E-01 


2.78E-01 


3.20E-01 


Web w6a 


FW 


97.28 


2.37E-01 


2.96E-01 


2.37E-01 


Web w6a 


MFW 


97.18 


1.31E-01 


5.02E-01 


4.11E-01 


Web w7a 


FW 


97.23 


6.22E-02 


3.88E-01 


2.81E-01 


Web w7a 


MFW 


97.23 


1.51E-01 


2.10E-01 


1.35E-01 


Web w8a 


FW 


97.06 


l.lOE-01 


2.76E-01 


2.32E-01 


Web w8a 


MFW 


97.24 


2.91E-01 


3.38E-01 


3.32E-01 



Table 8: Accuracies and training times obtained with a polynomial kernel. 
Statistics correspond to the mean and standard deviation obtained from 5 rep- 
etitions of each experiment. 



ter recognition problems (Letter, Pendigits, MNIST, USPS-Scaled and 
USPS-Ext). On datasets IJCNN and Reuters, CVM offers more accurate 
classifiers but employs a larger running time compared to FW and MFW. The 
same can be said about the KDD99-10pc problem, but in this case the speed- 
up offered by FW and MFW is considerably larger, up to two orders of magni- 
tude. The Shuttle dataset returns mixed results, which are probably due to the 
high lack of homogeneity in the size of the subproblems solved in the OVO de- 
composition approach. Finally, the two FW methods are clearly advantageous 
on Protein and KDD99-Full datasets, where they offer the same accuracy of 
CVMs along with a considerably improved running time. 

The results on the Web and Adult datasets are of particular interest and 
deserve further comment. They consist of a series of datasets of increasing size, 
and from their study we expect to gain an understanding of the performance 
of the algorithm as m gradually increases. In fact, as documented in [3D] and 
[25| these datasets have been commonly used to compare the scalability of SVM 
algorithms. In this regard, our results appear very encouraging. Not only do 
both FW algorithms outperform CVM in every instance of the Web collection 
with respect to CPU time, but the observed speedup increases monotonically 
as the dataset size increases, reaching a peak of two orders of magnitude for the 
FW method. Both algorithms also outperforms running times of CVM on all 
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but two datasets of the Adult collection, with very similar testing accuracies. 

The clear advantage of the MFW method with respect to both FW and 
CVM in the Adult series can be probably explained by the considerable size of 
the support vector set, which roughly amounts to a 60% of the full dataset, for 
all the methods. It is evident that, if becomes large, SMO iterations become 
quite expensive, slowing down the CVM procedure. As regards the advantage 
of MFW over FW, we interpret the results as follows. At the beginning of 
the training process, the algorithms start with a small approximating ball, and 
progressively expand it by including new examples. Intuitively, in the first 
iterations both methods tend to include a large number of points in order to 
increase the radius of the ball (and thus the objective value). Some of these 
examples do not belong to the optimal support vector set and the algorithms 
will try to remove them from the model once they approach the solution. When 
the support vector set is large, as in this case, the number of these spurious 
examples can be quite large, hampering the progress towards the optimum. 
Now, FW is not endowed with the possibility of explicitly removing points 
from the current coreset approximation, implying that the weights of useless 
patterns vanish only in the limit. That is, a large number of iterations may be 
taken before they drop below the tolerance under which they are numerically 
considered zeros. MFW, in contrast, possesses the ability to remove undesired 
points directly, and thus enjoys a considerable advantage when the number of 
such examples is not small. 

6 Conclusions and Perspectives 

In this paper we have described the application of e-coreset based methods 
from computational geometry to the task of efficiently training an SVM, an 
idea first proposed in [37j- We have introduced two algorithms falling in this 
category, both based on the Frank- Wolfe optimization scheme. These methods 
use analytical formulas to learn the classifier from the training set and thus do 
not require the solution of nested optimization subproblems. Compared with 
the results we presented in |10j . we have explored a variant of the algorithm 
which compares favorably in terms of testing accuracy and achieves training 
times similar to our original version. 

The large set of experiments we report in this paper confirms and consider- 
ably expands the conclusions reached in [TU] . As long as a minor loss in accuracy 
is acceptable, both Frank- Wolfe based methods are able to build SVM classi- 
fiers in a considerably smaller time compared to CVMs, which in turn have 
been proven in [37] to be faster than most traditional SVM software. These 
conclusions were statistically assessed using non-parametric tests. A second 
contribution of this work has been to present preliminary evidence about the 
capability to handle a wider family of kernels than CVMs, thus allowing for a 
greater flexibility in building a classifier. Further variations of this procedure 
will be explored in a future work, including learning tasks other than classifica- 
tion. 
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